DiagrammeR::mermaid('
%%{init: {"flowchart": {"htmlLabels": false}} }%%
graph BT
B[<b>movehubcostofliving.csv</b>] ---|City| A(<b>Combined_data</b> - contain details of all 3 datasets)
C[<b>movehubqualityoflife.csv</b>] -->|City| A
D[<b>read.csv</b>] -->|City| A
E[<b>worldcities.xlsx</b>] -->|City| D
style A fill:#f9f,stroke:#333,stroke-width:2px,font-weight:bold,font-style:italic;
')Assignment 3 _ Movehub City Rankings
Introduction:
MoveHub know that everyone has different reasons for moving. That’s why they have ranked some of the worlds top cities in relation to key factors, like safety, property affordability and climate. Simply selecting the factors most important to you and they will show you the best cities for you to consider!
For more details visit here Click here
We have total 3 files downloaded from Kaggle :
movehubqualityoflife.csv
movehubcostofliving.csv
cities.csv (merged LAT & LONG from data got from -> Downloadable Geographic Databases | Simplemaps.com)
Flow Diagram of Dataset:
I have merged all the 3 datset to one dataset (combined_data) in later operation. Also the 4th dataset which was having latitudinal and longitudinal data merged initially with city.csv dataset in excel.
Flow of the process performed:
Warning: package 'DiagrammeR' was built under R version 4.3.3
Importing important Libraries:
# Suppress warnings for conflicts when loading libraries
options(warn.conflicts = TRUE)
# Load libraries with conflicts suppressed
suppressWarnings({
library(tidyverse, exclude = c("annotate", "combine")) # For data manipulation and visualization
library(RColorBrewer) # For color palettes
library(tm) # For text mining and preprocessing
library(wordcloud) # For creating word clouds
library(gridExtra) # For arranging plots
library(leaflet) # For interactive maps
library(corrplot) # For visualizing correlation matrices
library(caTools) # For data splitting
library(DiagrammeR)
library(MASS, exclude = c("dplyr")) # For statistical functions
}) ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.0 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Loading required package: NLP
Attaching package: 'NLP'
The following object is masked from 'package:ggplot2':
annotate
Attaching package: 'gridExtra'
The following object is masked from 'package:dplyr':
combine
corrplot 0.92 loaded
Attaching package: 'MASS'
The following object is masked from 'package:dplyr':
select
# Reset options to default
options(warn.conflicts = FALSE)Loading the all 3 datasets:
cost_of_living <- read.csv("E:\\Assigment3\\movehubcostofliving.csv")
quality_of_life <- read.csv("E:\\Assigment3\\movehubqualityoflife.csv")
cities <- read.csv("E:\\Assigment3\\cities.csv")Dataset Understanding:
# Display all dataframes
head(cost_of_living) City Cappuccino Cinema Wine Gasoline Avg.Rent Avg.Disposable.Income
1 Lausanne 3.15 12.59 8.40 1.32 1714.00 4266.11
2 Zurich 3.28 12.59 8.40 1.31 2378.61 4197.55
3 Geneva 2.80 12.94 10.49 1.28 2607.95 3917.72
4 Basel 3.50 11.89 7.35 1.25 1649.29 3847.76
5 Perth 2.87 11.43 10.08 0.97 2083.14 3358.55
6 Nashville 3.84 12.00 13.50 0.65 2257.14 3089.75
head(quality_of_life) City Movehub.Rating Purchase.Power Health.Care Pollution
1 Caracas 65.18 11.25 44.44 83.45
2 Johannesburg 84.08 53.99 59.98 47.39
3 Fortaleza 80.17 52.28 45.46 66.32
4 Saint Louis 85.25 80.40 77.29 31.33
5 Mexico City 75.07 24.28 61.76 18.95
6 Detroit 70.63 73.81 63.05 83.45
Quality.of.Life Crime.Rating
1 8.61 85.70
2 51.26 83.93
3 36.68 78.65
4 87.51 78.13
5 27.91 77.86
6 50.99 76.69
head(cities) City Country LAT LONG
1 Oakland United States 37.7904 -122.2166
2 Oakville Canada 43.45 -79.6833
3 Oaxaca de Juárez Mexico 17.0606 -96.7253
4 Oberhausen Germany 51.4967 6.8706
5 Obihiro Japan 42.9167 143.2
6 Obninsk Russia 55.0931 36.6106
# Checking datatype
print("Datatype of 'cost_of_living' dataset:")[1] "Datatype of 'cost_of_living' dataset:"
print(str(cost_of_living))'data.frame': 216 obs. of 7 variables:
$ City : chr "Lausanne" "Zurich" "Geneva" "Basel" ...
$ Cappuccino : num 3.15 3.28 2.8 3.5 2.87 3.84 2.35 3.92 2.13 4.48 ...
$ Cinema : num 12.6 12.6 12.9 11.9 11.4 ...
$ Wine : num 8.4 8.4 10.49 7.35 10.08 ...
$ Gasoline : num 1.32 1.31 1.28 1.25 0.97 0.65 0.99 1.57 1.15 1.68 ...
$ Avg.Rent : num 1714 2379 2608 1649 2083 ...
$ Avg.Disposable.Income: num 4266 4198 3918 3848 3359 ...
NULL
print("Datatype of 'quality_of_life' dataset:")[1] "Datatype of 'quality_of_life' dataset:"
print(str(quality_of_life))'data.frame': 216 obs. of 7 variables:
$ City : chr "Caracas" "Johannesburg" "Fortaleza" "Saint Louis" ...
$ Movehub.Rating : num 65.2 84.1 80.2 85.2 75.1 ...
$ Purchase.Power : num 11.2 54 52.3 80.4 24.3 ...
$ Health.Care : num 44.4 60 45.5 77.3 61.8 ...
$ Pollution : num 83.5 47.4 66.3 31.3 18.9 ...
$ Quality.of.Life: num 8.61 51.26 36.68 87.51 27.91 ...
$ Crime.Rating : num 85.7 83.9 78.7 78.1 77.9 ...
NULL
print("Datatype of 'cities' dataset:")[1] "Datatype of 'cities' dataset:"
print(str(cities))'data.frame': 3543 obs. of 4 variables:
$ City : chr "Oakland" "Oakville" "Oaxaca de Juárez" "Oberhausen" ...
$ Country: chr "United States" "Canada" "Mexico" "Germany" ...
$ LAT : chr "37.7904" "43.45" "17.0606" "51.4967" ...
$ LONG : chr "-122.2166" "-79.6833" "-96.7253" "6.8706" ...
NULL
# Ensuring city names are consistent across datasets
cost_of_living$City <- tolower(cost_of_living$City)
quality_of_life$City <- tolower(quality_of_life$City)
cities$City <- tolower(cities$City)# Display the dimensions of datasets
print("Dimension of row-columns of 'cities' dataset:")[1] "Dimension of row-columns of 'cities' dataset:"
print(dim(cities))[1] 3543 4
print("Dimension of row-columns of 'cost of living' dataset:")[1] "Dimension of row-columns of 'cost of living' dataset:"
print(dim(cost_of_living))[1] 216 7
print("Dimension of row-columns of 'quality of life' dataset:")[1] "Dimension of row-columns of 'quality of life' dataset:"
print(dim(quality_of_life))[1] 216 7
Merging the data set based on City:
# Merge the datasets on the 'City' column
combined_data <- cost_of_living %>%
inner_join(quality_of_life, by = "City") %>%
inner_join(cities, by = "City")# displaying merged dataframe
head(combined_data) City Cappuccino Cinema Wine Gasoline Avg.Rent Avg.Disposable.Income
1 lausanne 3.15 12.59 8.40 1.32 1714.00 4266.11
2 geneva 2.80 12.94 10.49 1.28 2607.95 3917.72
3 basel 3.50 11.89 7.35 1.25 1649.29 3847.76
4 perth 2.87 11.43 10.08 0.97 2083.14 3358.55
5 nashville 3.84 12.00 13.50 0.65 2257.14 3089.75
6 canberra 2.35 11.42 10.08 0.99 1984.74 3023.91
Movehub.Rating Purchase.Power Health.Care Pollution Quality.of.Life
1 87.21 90.77 65.85 87.62 73.21
2 83.27 61.22 74.88 29.43 82.76
3 84.20 78.17 79.74 59.18 88.27
4 95.38 62.11 80.56 23.53 74.62
5 80.61 80.30 60.30 0.00 80.50
6 83.23 63.26 91.90 11.48 93.05
Crime.Rating Country LAT LONG
1 35.55 Switzerland 46.5198 6.6335
2 54.36 Switzerland 46.2017 6.1469
3 28.12 Switzerland 47.5547 7.5906
4 50.01 Australia -31.9559 115.8606
5 25.50 United States 36.1715 -86.7842
6 40.36 Australia -35.2931 149.1269
# Checking merged dataframe datatype
print("Datatype of 'combined_data' dataset:")[1] "Datatype of 'combined_data' dataset:"
print(str(combined_data))'data.frame': 201 obs. of 16 variables:
$ City : chr "lausanne" "geneva" "basel" "perth" ...
$ Cappuccino : num 3.15 2.8 3.5 2.87 3.84 2.35 3.92 2.13 4.48 2.49 ...
$ Cinema : num 12.6 12.9 11.9 11.4 12 ...
$ Wine : num 8.4 10.49 7.35 10.08 13.5 ...
$ Gasoline : num 1.32 1.28 1.25 0.97 0.65 0.99 1.57 1.15 1.68 0.95 ...
$ Avg.Rent : num 1714 2608 1649 2083 2257 ...
$ Avg.Disposable.Income: num 4266 3918 3848 3359 3090 ...
$ Movehub.Rating : num 87.2 83.3 84.2 95.4 80.6 ...
$ Purchase.Power : num 90.8 61.2 78.2 62.1 80.3 ...
$ Health.Care : num 65.8 74.9 79.7 80.6 60.3 ...
$ Pollution : num 87.6 29.4 59.2 23.5 0 ...
$ Quality.of.Life : num 73.2 82.8 88.3 74.6 80.5 ...
$ Crime.Rating : num 35.5 54.4 28.1 50 25.5 ...
$ Country : chr "Switzerland" "Switzerland" "Switzerland" "Australia" ...
$ LAT : chr "46.5198" "46.2017" "47.5547" "-31.9559" ...
$ LONG : chr "6.6335" "6.1469" "7.5906" "115.8606" ...
NULL
Data Preprocessing:
# Convert LAT and LONG columns to numeric if they aren't already
combined_data$LAT <- as.numeric(as.character(combined_data$LAT))
combined_data$LONG <- as.numeric(as.character(combined_data$LONG))#checking the null value
combined_data[combined_data == "N/A"]<-NA
colSums(is.na(combined_data)) City Cappuccino Cinema
0 0 0
Wine Gasoline Avg.Rent
0 0 0
Avg.Disposable.Income Movehub.Rating Purchase.Power
0 0 0
Health.Care Pollution Quality.of.Life
0 0 0
Crime.Rating Country LAT
0 0 0
LONG
0
# Define numerical and categorical columns
numerical_columns <- names(combined_data)[sapply(combined_data, is.numeric)]
categorical_columns <- names(combined_data)[sapply(combined_data, function(x) is.factor(x) | is.character(x))]# Summary statistics
summary(combined_data) City Cappuccino Cinema Wine
Length:201 Min. :0.460 Min. : 1.810 Min. : 2.130
Class :character 1st Qu.:1.310 1st Qu.: 4.270 1st Qu.: 4.260
Mode :character Median :2.040 Median : 6.820 Median : 6.530
Mean :1.966 Mean : 6.829 Mean : 7.103
3rd Qu.:2.490 3rd Qu.: 7.970 3rd Qu.: 8.400
Max. :4.480 Max. :79.490 Max. :26.150
Gasoline Avg.Rent Avg.Disposable.Income Movehub.Rating
Min. :0.070 Min. : 120.7 Min. : 120.7 Min. : 59.88
1st Qu.:0.800 1st Qu.: 605.6 1st Qu.: 556.7 1st Qu.: 75.32
Median :1.000 Median : 980.6 Median :1460.5 Median : 80.85
Mean :1.025 Mean :1089.0 Mean :1384.9 Mean : 79.72
3rd Qu.:1.350 3rd Qu.:1382.3 3rd Qu.:2013.6 3rd Qu.: 83.55
Max. :1.690 Max. :5052.3 Max. :4266.1 Max. :100.00
Purchase.Power Health.Care Pollution Quality.of.Life
Min. : 6.38 Min. :20.83 Min. : 0.00 Min. : 5.29
1st Qu.:28.83 1st Qu.:58.90 1st Qu.:24.54 1st Qu.:42.65
Median :47.98 Median :67.88 Median :39.16 Median :62.82
Mean :45.41 Mean :66.31 Mean :46.25 Mean :58.76
3rd Qu.:58.77 3rd Qu.:77.31 3rd Qu.:68.21 3rd Qu.:77.07
Max. :91.85 Max. :95.96 Max. :92.42 Max. :93.05
Crime.Rating Country LAT LONG
Min. :10.86 Length:201 Min. :-43.53 Min. :-157.85
1st Qu.:28.91 Class :character 1st Qu.: 22.30 1st Qu.: -9.15
Median :41.32 Mode :character Median : 40.40 Median : 10.00
Mean :41.38 Mean : 30.38 Mean : 10.50
3rd Qu.:50.38 3rd Qu.: 48.78 3rd Qu.: 46.72
Max. :85.70 Max. : 63.43 Max. : 174.78
Exploratory Data Analysis:
1. Count of Unique Values in Each Categorical Column:
# Count of Unique Values in Each Categorical Column
unique_counts <- sapply(combined_data[sapply(combined_data, is.character)], function(x) length(unique(x)))
# Plotting the counts as a pie chart
pie(unique_counts, main = "Count of Unique Values in Each Categorical Column", col = rainbow(length(unique_counts)),
labels = paste(names(unique_counts), ": ", unique_counts), cex = 0.8)2. Wordcloud of country & cities in data set:
# Create a corpus from the "Country" column
country_corpus <- Corpus(VectorSource(combined_data$Country))
# Preprocess the text for country names (remove punctuation, convert to lowercase, remove stopwords)
suppressWarnings({
country_corpus <- tm_map(country_corpus, content_transformer(tolower))
country_corpus <- tm_map(country_corpus, removePunctuation)
country_corpus <- tm_map(country_corpus, removeNumbers)
country_corpus <- tm_map(country_corpus, removeWords, stopwords("english"))
})
# Define a custom function to remove empty documents from the corpus
removeEmptyDocs <- function(corpus) {
empty_docs <- which(sapply(corpus, function(x) length(unlist(strsplit(as.character(x), " ")))) == 0)
if (length(empty_docs) > 0) {
corpus <- corpus[-empty_docs]
}
return(corpus)
}
# Remove empty documents from the corpus
suppressWarnings({
country_corpus <- removeEmptyDocs(country_corpus)
})
# Convert corpus to a plain text document
country_text <- tm_map(country_corpus, content_transformer(as.character))Warning in tm_map.SimpleCorpus(country_corpus,
content_transformer(as.character)): transformation drops documents
# Generate word cloud for countries with adjustments for long country names
wordcloud(country_text, min.freq = 1, random.order = FALSE, colors = brewer.pal(8, "Dark2"),
max.words = 100, scale=c(4, 0.5))# Create a corpus from the "City" column
city_corpus <- Corpus(VectorSource(combined_data$City))
# Preprocess the text for city names (remove punctuation, convert to lowercase, remove stopwords)
suppressWarnings({
city_corpus <- tm_map(city_corpus, content_transformer(tolower))
city_corpus <- tm_map(city_corpus, removePunctuation)
city_corpus <- tm_map(city_corpus, removeNumbers)
city_corpus <- tm_map(city_corpus, removeWords, stopwords("english"))
})
# Define a custom function to remove empty documents from the corpus
removeEmptyDocs <- function(corpus) {
empty_docs <- which(sapply(corpus, function(x) length(unlist(strsplit(as.character(x), " ")))) == 0)
if (length(empty_docs) > 0) {
corpus <- corpus[-empty_docs]
}
return(corpus)
}
# Remove empty documents from the corpus
suppressWarnings({
city_corpus <- removeEmptyDocs(city_corpus)
})
# Convert corpus to a plain text document
city_text <- tm_map(city_corpus, content_transformer(as.character))Warning in tm_map.SimpleCorpus(city_corpus, content_transformer(as.character)):
transformation drops documents
# Generate word cloud for cities with adjustments for long city names
wordcloud(city_text, min.freq = 1, random.order = FALSE, colors = brewer.pal(8, "Dark2"),
max.words = 50, scale=c(2, 0.5))3. Histogram of distribution of numerical columns:
# Loop through each numerical column to create histograms
for(column_name in numerical_columns) {
# Print the name of the current column
print(column_name)
# Create a symbol from the string column name
column_sym <- sym(column_name)
# Plot histogram for the current numerical column using tidy evaluation
p <- ggplot(combined_data, aes(x = !!column_sym)) +
geom_histogram(binwidth = 1, fill = "blue", color = "black") +
theme_minimal() +
labs(title = paste("Histogram of", column_name))
# This will actually draw the plot
print(p)
}[1] "Cappuccino"
[1] "Cinema"
[1] "Wine"
[1] "Gasoline"
[1] "Avg.Rent"
[1] "Avg.Disposable.Income"
[1] "Movehub.Rating"
[1] "Purchase.Power"
[1] "Health.Care"
[1] "Pollution"
[1] "Quality.of.Life"
[1] "Crime.Rating"
[1] "LAT"
[1] "LONG"
4. Are there any significant differences in Cappuccino prices across different countries?
# Filter the data to include only the top 10 countries by Cappuccino prices
top_10_countries <- combined_data %>%
group_by(Country) %>%
summarize(Avg_Cappuccino_Price = mean(Cappuccino)) %>%
top_n(10, Avg_Cappuccino_Price) %>%
pull(Country)
filtered_data <- combined_data %>%
filter(Country %in% top_10_countries)
# Create a boxplot to visualize the distribution of Cappuccino prices across the top 10 countries
ggplot(filtered_data, aes(x = Country, y = Cappuccino)) +
geom_boxplot(fill = "skyblue", color = "black") +
labs(title = "Distribution of Cappuccino Prices Across Top 10 Countries",
x = "Country",
y = "Cappuccino Price (USD)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Rotate x-axis labels for better readability# Perform ANOVA to test for significant differences in Cappuccino prices across the top 10 countries
anova_result <- aov(Cappuccino ~ Country, data = filtered_data)
summary(anova_result) Df Sum Sq Mean Sq F value Pr(>F)
Country 9 3.526 0.3917 3.657 0.0278 *
Residuals 10 1.071 0.1071
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Note: Yes, there are significant differences in Cappuccino prices across different countries according to the ANOVA test results.
5. Geographical distribution of top 10 cities with high Movehub Ratings:
# Order cities by Movehub Rating and select top 10
top_10_cities <- combined_data[order(combined_data$Movehub.Rating, decreasing = TRUE), ][1:10, ]
# Create Leaflet map
m <- leaflet(top_10_cities) %>%
addTiles() %>% # Add default OpenStreetMap map tiles
addCircleMarkers(
lng = ~LONG, lat = ~LAT,
radius = ~sqrt(Movehub.Rating) * 3, # Adjust circle marker size based on Movehub Rating
color = "#4CAF50", fillColor = "#4CAF50", fillOpacity = 0.7,
popup = ~paste("<b>", City, "</b>", "<br>", "<b>Movehub Rating:</b> ", Movehub.Rating)
) %>%
addProviderTiles(providers$Esri.WorldTopoMap) %>%
addScaleBar(position = "bottomleft") %>%
addLegend(
position = "bottomright",
colors = "#4CAF50",
labels = "Circle Radius (Movehub Rating)"
) %>%
addEasyButton(easyButton(
icon = "fa-info-circle", title = "Information",
onClick = JS("function(btn, map) { alert('This map displays the top 10 cities based on Movehub Rating, with circle markers representing Movehub Rating.'); }")
))
# Display the map
m 6. Geographical distribution of top 10 cities with Quality.of.Life:
# Order cities by Quality of Life and select top 10
top_10_cities <- combined_data[order(combined_data$Quality.of.Life, decreasing = TRUE), ][1:10, ]
# Create Leaflet map
m <- leaflet(top_10_cities) %>%
addTiles() %>% # Add default OpenStreetMap map tiles
addCircleMarkers(
lng = ~LONG, lat = ~LAT,
radius = ~sqrt(Quality.of.Life) * 3, # Adjust circle marker size based on Quality of Life
color = "#2196F3", fillColor = "#2196F3", fillOpacity = 0.7,
popup = ~paste("<b>", City, "</b>", "<br>", "<b>Quality of Life:</b> ", Quality.of.Life)
) %>%
addProviderTiles(providers$Esri.WorldTopoMap) %>%
addScaleBar(position = "bottomleft") %>%
addLegend(
position = "bottomright",
colors = "#2196F3",
labels = "Circle Radius (Quality of Life)"
) %>%
addEasyButton(easyButton(
icon = "fa-info-circle", title = "Information",
onClick = JS("function(btn, map) { alert('This map displays the top 10 cities based on Quality of Life, with circle markers representing Quality of Life.'); }")
))
# Display the map
m 7. Geographical distribution of top 10 cities with Crime Rating:
# Order cities by Crime Rating and select top 10
top_10_cities <- combined_data[order(combined_data$Crime.Rating), ][1:10, ]
# Create Leaflet map
m <- leaflet(top_10_cities) %>%
addTiles() %>% # Add default OpenStreetMap map tiles
addCircleMarkers(
lng = ~LONG, lat = ~LAT,
radius = ~sqrt(Crime.Rating) * 3, # Adjust circle marker size based on Crime Rating
color = "#FF5722", fillColor = "#FF5722", fillOpacity = 0.7,
popup = ~paste("<b>", City, "</b>", "<br>", "<b>Crime Rating:</b> ", Crime.Rating)
) %>%
addProviderTiles(providers$Esri.WorldTopoMap) %>%
addScaleBar(position = "bottomleft") %>%
addLegend(
position = "bottomright",
colors = "#FF5722",
labels = "Circle Radius (Crime Rating)"
) %>%
addEasyButton(easyButton(
icon = "fa-info-circle", title = "Information",
onClick = JS("function(btn, map) { alert('This map displays the top 10 cities with the lowest Crime Rating, with circle markers representing Crime Rating.'); }")
))
# Display the map
m #Creating new dataframe for Modelling:
res <- combined_data[complete.cases(combined_data$LAT, combined_data$LONG), ] Overall correlation:
# Overall correlation
num.cols <- sapply(res, is.numeric)
corrPLOT <- corrplot(cor(res[,num.cols]), method='ellipse', order="AOE")- Movehub Rating has negative correlation with Pollution, Crime as one can expect
- Other features give a positive correlation with Movehub Rating
(better) Correlation:
# (better) Correlation
res <- data.frame(res %>% dplyr::select(-Quality.of.Life,-City,-Country))
num.cols <- sapply(res, is.numeric)
corrPLOT <- corrplot(cor(res[,num.cols]), method='number', order="AOE") - The higher correlated features with Movehub.Rating are Purchase.Power and `Avg.Disposable.Income
- Gasoline does not show a strong correlation (to my surprise)
- Cappuccino (I guess it’s the average price) is almost as important as Health Care.
Overall, there are linear correlations between Movehub Rating and the other numerical features. So we may try as a first test a linear model.
APPLYING MODEL TO PREDICT–> Movehub Rating score :
# Train Test Split
set.seed(101)
split <- sample.split(res$Movehub.Rating, SplitRatio = 0.7)
train <- subset(res, split == TRUE)
test <- subset(res, split == FALSE)# Function to plot the result of a given model and other information
plotModel <- function(mod) {
# Create dataframe with prediction and real values
mod.predictions <- predict(mod, test)
mod.res <- cbind(mod.predictions, test$Movehub.Rating)
colnames(mod.res) <- c('Predictions', 'True')
mod.res <- as.data.frame(mod.res)
# Make plots of residuals, etc.
g1 <- ggplot(data = mod.res, aes(x = Predictions, y = True)) +
geom_point() + xlim(50, 100) + ylim(50, 100) +
geom_abline(intercept = 0, slope = 1, color = 'red')
g2 <- ggplot(data = mod.res, aes(x = True - Predictions)) +
geom_histogram(bins = 50)
g3 <- ggplot(data = mod.res, aes(x = Predictions, y = True - Predictions)) +
geom_point()
# Arrange the plots
grid.arrange(g1, g2, g3, nrow = 2, ncol = 2)
# Calculate metrics
mse <- mean((mod.res$True - mod.res$Predictions)^2)
rmse <- mse^0.5
SSE <- sum((mod.res$Predictions - mod.res$True)^2)
SST <- sum((mean(test$Movehub.Rating) - mod.res$True)^2)
R2 <- 1 - SSE / SST
cat("MSE: %f RMSE : %f R^2 :%f", mse, rmse, R2)
}-Linear model with all features:
# Applying linear Regression
linModel <- lm(Movehub.Rating ~ ., data = train)
summary(linModel)
Call:
lm(formula = Movehub.Rating ~ ., data = train)
Residuals:
Min 1Q Median 3Q Max
-7.7935 -1.4489 -0.2631 1.2780 14.4159
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 58.3390764 2.6577153 21.951 < 2e-16 ***
Cappuccino -0.2624010 0.5396183 -0.486 0.62761
Cinema 0.0206004 0.0449341 0.458 0.64741
Wine 0.1021572 0.1294251 0.789 0.43140
Gasoline 2.9019739 1.0225435 2.838 0.00529 **
Avg.Rent 0.0023670 0.0005704 4.149 6.06e-05 ***
Avg.Disposable.Income -0.0016842 0.0007950 -2.119 0.03607 *
Purchase.Power 0.3244978 0.0296638 10.939 < 2e-16 ***
Health.Care 0.0313351 0.0203895 1.537 0.12682
Pollution -0.0040804 0.0118137 -0.345 0.73037
Crime.Rating 0.0349714 0.0202210 1.729 0.08616 .
LAT -0.0161790 0.0129305 -1.251 0.21315
LONG 0.0095956 0.0046706 2.054 0.04198 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.045 on 127 degrees of freedom
Multiple R-squared: 0.801, Adjusted R-squared: 0.7822
F-statistic: 42.61 on 12 and 127 DF, p-value: < 2.2e-16
# Plotting the graph
plotModel(linModel)MSE: %f RMSE : %f R^2 :%f 17.39767 4.171052 0.6145516
-Linear model with stepwise regression:
From the summary of the linear model, we see that not all the features are significant. So we can try to improve this linear model by performing a stepwise selection of features in both direction.
# Linear model with stepwise regression
step <- stepAIC(linModel, direction = "both")Start: AIC=324.18
Movehub.Rating ~ Cappuccino + Cinema + Wine + Gasoline + Avg.Rent +
Avg.Disposable.Income + Purchase.Power + Health.Care + Pollution +
Crime.Rating + LAT + LONG
Df Sum of Sq RSS AIC
- Pollution 1 1.11 1179.0 322.31
- Cinema 1 1.95 1179.9 322.41
- Cappuccino 1 2.19 1180.1 322.44
- Wine 1 5.78 1183.7 322.87
- LAT 1 14.52 1192.4 323.90
<none> 1177.9 324.18
- Health.Care 1 21.91 1199.8 324.76
- Crime.Rating 1 27.74 1205.7 325.44
- LONG 1 39.15 1217.1 326.76
- Avg.Disposable.Income 1 41.63 1219.5 327.04
- Gasoline 1 74.70 1252.6 330.79
- Avg.Rent 1 159.70 1337.6 339.98
- Purchase.Power 1 1109.89 2287.8 415.12
Step: AIC=322.31
Movehub.Rating ~ Cappuccino + Cinema + Wine + Gasoline + Avg.Rent +
Avg.Disposable.Income + Purchase.Power + Health.Care + Crime.Rating +
LAT + LONG
Df Sum of Sq RSS AIC
- Cappuccino 1 1.76 1180.8 320.52
- Cinema 1 2.68 1181.7 320.63
- Wine 1 5.99 1185.0 321.02
- LAT 1 16.41 1195.4 322.25
<none> 1179.0 322.31
- Health.Care 1 21.26 1200.3 322.81
- Crime.Rating 1 27.26 1206.3 323.51
+ Pollution 1 1.11 1177.9 324.18
- LONG 1 38.71 1217.7 324.83
- Avg.Disposable.Income 1 43.59 1222.6 325.39
- Gasoline 1 90.60 1269.6 330.68
- Avg.Rent 1 166.71 1345.7 338.83
- Purchase.Power 1 1132.63 2311.7 414.57
Step: AIC=320.52
Movehub.Rating ~ Cinema + Wine + Gasoline + Avg.Rent + Avg.Disposable.Income +
Purchase.Power + Health.Care + Crime.Rating + LAT + LONG
Df Sum of Sq RSS AIC
- Cinema 1 2.92 1183.7 318.87
- Wine 1 4.42 1185.2 319.04
<none> 1180.8 320.52
- LAT 1 19.59 1200.4 320.82
- Health.Care 1 21.29 1202.1 321.02
- Crime.Rating 1 26.81 1207.6 321.66
+ Cappuccino 1 1.76 1179.0 322.31
+ Pollution 1 0.67 1180.1 322.44
- LONG 1 38.55 1219.3 323.02
- Avg.Disposable.Income 1 57.38 1238.2 325.16
- Gasoline 1 89.15 1269.9 328.71
- Avg.Rent 1 164.99 1345.8 336.83
- Purchase.Power 1 1161.98 2342.8 414.44
Step: AIC=318.87
Movehub.Rating ~ Wine + Gasoline + Avg.Rent + Avg.Disposable.Income +
Purchase.Power + Health.Care + Crime.Rating + LAT + LONG
Df Sum of Sq RSS AIC
- Wine 1 7.99 1191.7 317.81
<none> 1183.7 318.87
- LAT 1 18.29 1202.0 319.01
- Health.Care 1 20.92 1204.6 319.32
- Crime.Rating 1 27.58 1211.3 320.09
+ Cinema 1 2.92 1180.8 320.52
+ Cappuccino 1 2.00 1181.7 320.63
+ Pollution 1 1.27 1182.4 320.72
- LONG 1 40.95 1224.7 321.63
- Avg.Disposable.Income 1 55.73 1239.4 323.31
- Gasoline 1 88.80 1272.5 326.99
- Avg.Rent 1 162.06 1345.8 334.83
- Purchase.Power 1 1183.33 2367.0 413.88
Step: AIC=317.81
Movehub.Rating ~ Gasoline + Avg.Rent + Avg.Disposable.Income +
Purchase.Power + Health.Care + Crime.Rating + LAT + LONG
Df Sum of Sq RSS AIC
<none> 1191.7 317.81
- LAT 1 20.02 1211.7 318.14
- Health.Care 1 21.71 1213.4 318.34
- Crime.Rating 1 23.06 1214.8 318.49
+ Wine 1 7.99 1183.7 318.87
+ Cinema 1 6.49 1185.2 319.04
+ Pollution 1 2.58 1189.1 319.51
+ Cappuccino 1 0.06 1191.6 319.80
- LONG 1 43.24 1234.9 320.80
- Avg.Disposable.Income 1 48.09 1239.8 321.35
- Gasoline 1 84.26 1276.0 325.37
- Avg.Rent 1 188.27 1380.0 336.34
- Purchase.Power 1 1214.75 2406.4 414.20
step$anovaStepwise Model Path
Analysis of Deviance Table
Initial Model:
Movehub.Rating ~ Cappuccino + Cinema + Wine + Gasoline + Avg.Rent +
Avg.Disposable.Income + Purchase.Power + Health.Care + Pollution +
Crime.Rating + LAT + LONG
Final Model:
Movehub.Rating ~ Gasoline + Avg.Rent + Avg.Disposable.Income +
Purchase.Power + Health.Care + Crime.Rating + LAT + LONG
Step Df Deviance Resid. Df Resid. Dev AIC
1 127 1177.913 324.1800
2 - Pollution 1 1.106496 128 1179.020 322.3115
3 - Cappuccino 1 1.758833 129 1180.779 320.5202
4 - Cinema 1 2.924060 130 1183.703 318.8665
5 - Wine 1 7.988275 131 1191.691 317.8081
finalModel <- lm(Movehub.Rating ~ Cappuccino + Gasoline + Avg.Rent + Purchase.Power + Health.Care, data = train)
summary(finalModel)
Call:
lm(formula = Movehub.Rating ~ Cappuccino + Gasoline + Avg.Rent +
Purchase.Power + Health.Care, data = train)
Residuals:
Min 1Q Median 3Q Max
-9.2675 -1.8157 -0.3025 1.5552 14.7603
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 63.4341660 1.3299976 47.695 < 2e-16 ***
Cappuccino -0.6709330 0.4581963 -1.464 0.1455
Gasoline 1.4177586 0.7911495 1.792 0.0754 .
Avg.Rent 0.0022600 0.0004962 4.554 1.17e-05 ***
Purchase.Power 0.2552946 0.0171398 14.895 < 2e-16 ***
Health.Care 0.0296582 0.0206849 1.434 0.1540
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.194 on 134 degrees of freedom
Multiple R-squared: 0.7691, Adjusted R-squared: 0.7605
F-statistic: 89.26 on 5 and 134 DF, p-value: < 2.2e-16
# Plotting the graph of residuals for the final model
plotModel(finalModel)MSE: %f RMSE : %f R^2 :%f 15.75842 3.969687 0.6508697
-GLM model:
# Fit a glm model using your data
glm_model <- glm(Movehub.Rating ~ ., data = train, family = gaussian)# Apply the plotModel function to the glm model
plotModel(glm_model)MSE: %f RMSE : %f R^2 :%f 17.39767 4.171052 0.6145516
# Comparison between all features and selected ones
anova(linModel, finalModel,glm_model)Analysis of Variance Table
Model 1: Movehub.Rating ~ Cappuccino + Cinema + Wine + Gasoline + Avg.Rent +
Avg.Disposable.Income + Purchase.Power + Health.Care + Pollution +
Crime.Rating + LAT + LONG
Model 2: Movehub.Rating ~ Cappuccino + Gasoline + Avg.Rent + Purchase.Power +
Health.Care
Model 3: Movehub.Rating ~ Cappuccino + Cinema + Wine + Gasoline + Avg.Rent +
Avg.Disposable.Income + Purchase.Power + Health.Care + Pollution +
Crime.Rating + LAT + LONG
Res.Df RSS Df Sum of Sq F Pr(>F)
1 127 1177.9
2 134 1367.1 -7 -189.22 2.9144 0.007341 **
3 127 1177.9 7 189.22 2.9144 0.007341 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Note: From all the above model i found that lm and glm with all featurer didnot achieve a better R^2 score over stepwise linear regression (0.6508697 vs. 0.6145516)
——————————————————————————————————-
THANKS YOU !!
You can find the code and other data sets here: https://github.com/babluprasad70/R_Assignment